Descriptive statistics

Sven Rieger

December 1, 2023

Preface: Software I

  • The following packages are used:
descrPkg <- c("merTools",
              "sn",
              "knitr",
              "flextable",
              "psych",
              "ggplot2")
  • Install packages when not already installed:
lapply(descrPkg,
        function(x) 
          if(!x %in% rownames(installed.packages())) {
            install.packages(x)
            }
          )
  • Load (a subset of) the required package(s) into the R session.
library(ggplot2)
library(flextable)

Preface: Software II

Print list of packages and cite them via Pandoc citation.

Show/hide fenced code
```{r}
#| label: write-pkgs
#| code-fold: true
#| code-summary: "Show/hide fenced code"
#| output-location: fragment
#| output: asis

for (i in 1:length(descrPkg)) {
  
  cat(paste0(i, ". ",
             descrPkg[i],
             " [", "v", utils::packageVersion(descrPkg[i]),", @R-", descrPkg[i],
             "]\n"))
}
```
  1. merTools (v0.6.2, R-merTools?)
  2. sn (v2.1.1, R-sn?)
  3. knitr (v1.45, Xie, 2023)
  4. flextable (v0.9.5, Gohel & Skintzos, 2023)
  5. psych (v2.4.3, Revelle, 2023)
  6. ggplot2 (v3.4.4, Wickham et al., 2023)

Preface: Data matrix

Variables (e.g., characterisics), units (e.g., persons) and data (e.g., measurements) are often presented in matrix form. A matrix is a system of \(n \cdot p\) quantities and looks like in the following:

\[ \begin{bmatrix} X_{11} & X_{12} & \cdots & X_{1p} \\ X_{21} & X_{22} & \cdots & X_{2p} \\ \vdots & \vdots & & \vdots \\ X_{n1} & X_{n2} & \cdots & X_{np} \end{bmatrix} \]

  • \(n\) rows; 1 row is also known as a vector or row matrix
  • \(p\) columns; 1 column is also known as a vector or column matrix

Descriptive statistics: Overview

  • Measures of shape
    • Skewness (not covered in slides)
    • Kurtosis (not covered in slides)

Example

Consider the following example vector:

exVec <- c(1, 2, 5, 3, 8)

Mean

The mean (or arithmetic mean, average) is the sum of a collection of numbers divided by the count of numbers in the collection. The formula is given in Equation 1.

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i=\frac{x_1+x_2+\dots+x_n}{n} \qquad(1)\]

For example, consider a vector of numbers: \(x = 1, 2, 5, 3, 8\)

\[\bar{x} = \frac{(1+2+5+3+8)}{5}=3.8\]

If the underlying data is a sample (i.e., a subset of a population), it is called the sample mean.

How to calculate the mean in R?

mean(exVec)
[1] 3.8

If there is missing data (in R denoted by NA), we set the argument na.rm to TRUE. To demonstrate this we create another example vector (exVec2).

exVec2 <- c(1, 2, 5, 3, 8, NA)
mean(exVec2)
[1] NA
mean(exVec2, na.rm = T)
[1] 3.8

Median

The median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as “the middle” value. The formulas are given in Equation 2.

\[ Mdn = \widetilde{x} = \begin{cases} x_{(n+1)/2} & \:\: \text{if } n \text{ is odd} \\ (x_{n/2} + x_{(n/2)+1}) / 2 & \:\: \text{if } n \text{ is even} \end{cases} \qquad(2)\]

Consider again the vector of numbers: \(x = 1, 2, 5, 3, 8\) with length \(n = 5\). To calculate the median you need to first, order the the vector: \(x = 1, 2, 3, 5, 8\) and then apply the corresponding formula (odd vs. even; here odd):

\[\widetilde{x}=x_{\frac{(5+1)}{2}}=x_3 = 3\]

How to calculate the median in R?

median(exVec)
[1] 3

Variance

The variance is the expectation of the squared deviation of a random variable from its mean. Usually it is distinguished between the population and the sample variance. The formula of the population variance is given in Equation 3.

\[VAR(X) = \sigma^2 = \frac{1}{N} \sum\limits_{i=1}^N (x_i - \mu)^2 \qquad(3)\]

The formula of the sample variance is given in Equation 4.

\[ VAR(X) = s^2 = \frac{1}{n-1} \sum\limits_{i=1}^n (x_i - \bar{x})^2 \qquad(4)\]

Using again the vector \(x = 1, 2, 5, 3, 8\), the sample variance is calculated as follows:

\[Var(X) =\frac{1}{4}((1-3.8)^2 + (2-3.8)^2 + (5-3.8)^2 + (3-3.8)^2 + (8-3.8)^2) = 7.7\]

How to calculate the variance in R?

var(exVec)
[1] 7.7

Standard Deviation

The standard deviation is defined as the square root of the variance. Again, it is distinguished between the population and the sample variance. The formula of the population standard deviation is given in Equation 5.

\[SD(X) = \sigma = \sqrt{\sigma^2} \qquad(5)\]

The formula of the population standard deviation is given in Equation 6.

\[SD(X) = s = \sqrt{s^2} \qquad(6)\]

Recall the variance calculation from the previous slide, the (sample) variance of the vector is \(7.7\).

\[SD(X) = \sqrt{7.7}=2.774887\]

How to calculate the standard deviation in R?

sd(exVec)
[1] 2.774887

Range

The range of a vector is the difference between the largest (maximum) and the smallest (minimum) values/observations.

\[Range(x) = R = x_{max}-x_{min} \qquad(7)\]

How to calculate the range in R?

range(exVec)
[1] 1 8

Alternatively, calculate the minimum…

min(exVec)
[1] 1

… and the maximum.

max(exVec)
[1] 8

And to compute the range apply Equation 7.

max(exVec)-min(exVec)
[1] 7

Put everything together I

Recall, the dataset dat is the HSB dataset from the merTools package:

dat <- merTools::hsb

Calculating the mean, standard deviation, minimum and maximum for a set of variables:


1myVar <- c("Math achievement" = "mathach",
           "Gender" = "female",
           "Socioeconomic status" = "ses",
           "Class size" = "size")

2exDescr <- apply(
3  X = dat[,myVar],
4  MARGIN = 2,
5  FUN = function(x) {
6    ret <- c(
             mean(x, na.rm = T),
             sd(x, na.rm = T),
             min(x, na.rm = T),
             max(x, na.rm = T)
             )
7    return(ret)
    })
1
Create a (named) character vector of the variables by using the c() function.
2
Use the apply function to apply a or multiple function(s) on data (here: 4 columns).
3
The input is the dataset with the selected columns of interest (see 1.).
4
MARGIN = 2 indicates that the function should be applied over columns.
5
Create the function that should be applied. Here we calculate the mean(), sd(), min() and max().
6
Create a temporary R object, which should be later returned (here: the vector ret)
7
Return the temporary object and close functions.

Put everything together II

Print the results…

exDescr |>
  print()
       mathach    female           ses      size
[1,] 12.747853 0.5281837  0.0001433542 1056.8618
[2,]  6.878246 0.4992398  0.7793551951  604.1725
[3,] -2.832000 0.0000000 -3.7580000000  100.0000
[4,] 24.993000 1.0000000  2.6920000000 2713.0000


This is a weird format; variables should be in rows not columns. Transpose…

exDescr |>
  t() |>
  print()
                [,1]        [,2]    [,3]     [,4]
mathach 1.274785e+01   6.8782457  -2.832   24.993
female  5.281837e-01   0.4992398   0.000    1.000
ses     1.433542e-04   0.7793552  -3.758    2.692
size    1.056862e+03 604.1724993 100.000 2713.000


Better, but still not really convincing…

Making a table I


1exDescrTab <- exDescr |>
2    t() |>
    as.data.frame() |>
3    (\(d) cbind(names(myVar), d))() |>
4    flextable() |>
5    theme_apa() |>
6    set_header_labels(
      "names(myVar)" = "Variables",
      V1 = "Mean",
      V2 = "SD",
      V3 = "Min",
      V4 = "Max") |>
7    align(part = "body", align = "c") |>
    align(j = 1, part = "all", align = "l") |>
8    add_footer_lines(
      as_paragraph(as_i("Note. "),
                   "This is a footnote.")
      ) |>
    align(align = "left", part = "footer") |>
9    width(j = 1, width = 2, unit = "in") |>
    width(j = 2:5, width = 1, unit = "in")
1
Take the results (here: exDescr object)…
2
…and transpose (i.e., using the t() function) and coerce it to a data.frame object (as.data.frame())
3
Use the so-called lambda (or anonymous) function to bind (using the cbind() function) the variable names as the first column to the dataset.
4
Apply the flextable() function.
5
Use the APA theme (theme_apa()).
6
Rename the column names (set_header_labels()).
7
Center body part of the table (align()).
8
Add a footnote (add_footer_lines) and align it to the left.
9
Change column width (width) to 2 resp. 1 inch.

Making a table II

Print the table.

exDescrTab
Table 1: Descriptive statistics

Variables

Mean

SD

Min

Max

Math achievement

12.75

6.88

-2.83

24.99

Gender

0.53

0.50

0.00

1.00

Socioeconomic status

0.00

0.78

-3.76

2.69

Class size

1,056.86

604.17

100.00

2,713.00

Note. This is a footnote.

Table export

If you want to export the table…

exDescrTab |>
  set_caption(caption = "Table X.\nDescriptive statistics") |>
  save_as_docx(path = "descr-tab.docx")

Descriptive statistics with the psych package

  • Alternatively, it is convenient to use additional R packages such as the psych package (Revelle, 2023) to calculate descriptive statistics

  • Here we use the describe function (with the fast argument set to TRUE) to calculate the descriptive statistics of all variables within the example data set

dat |>
  subset(select = -c(1)) |>
  psych::describe(fast = TRUE) |>
  flextable() 
Table 2: Descriptive statistics with the psych package

vars

n

mean

sd

median

min

max

range

skew

kurtosis

se

1

7,185

0.2747390397

0.4464137

0.000

0.000

1.000

1.000

1.00906215

-0.9819302

0.005266525

2

7,185

0.5281837161

0.4992398

1.000

0.000

1.000

1.000

-0.11289082

-1.9875322

0.005889736

3

7,185

0.0001433542

0.7793552

0.002

-3.758

2.692

6.450

-0.22809706

-0.3804498

0.009194372

4

7,185

12.7478526096

6.8782457

13.131

-2.832

24.993

27.825

-0.18049184

-0.9215987

0.081145473

5

7,185

1,056.8617954071

604.1724993

1,016.000

100.000

2,713.000

2,613.000

0.57149608

-0.3649453

7.127669715

6

7,185

0.4931106472

0.4999873

0.000

0.000

1.000

1.000

0.02755427

-1.9995190

0.005898555

7

7,185

0.0061384830

0.4135539

0.038

-1.188

0.831

2.019

-0.26812150

-0.4797392

0.004878864

Exercise

Style the table according to your ideas/demands.

References

Eid, M., Gollwitzer, M., & Schmitt, M. (2013). Statistik und Forschungsmethoden: Lehrbuch ; mit Online-Materialien (3., korrigierte Auflage). Beltz.
Gohel, D., & Skintzos, P. (2023). Flextable: Functions for tabular reporting. https://ardata-fr.github.io/flextable-book/
Revelle, W. (2023). Psych: Procedures for psychological, psychometric, and personality research. https://personality-project.org/r/psych/ https://personality-project.org/r/psych-manual.pdf
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K., Yutani, H., & Dunnington, D. (2023). ggplot2: Create elegant data visualisations using the grammar of graphics. https://ggplot2.tidyverse.org
Xie, Y. (2023). Knitr: A general-purpose package for dynamic report generation in r. https://yihui.org/knitr/